Algorithms for bigram and trigram word clustering
نویسندگان
چکیده
CLUSTERING Sven Martin, J org Liermann, Hermann Ney Lehrstuhl f ur Informatik VI, RWTH Aachen, University of Technology, D-52056 Aachen, Germany ABSTRACT. This paper presents and analyzes improved algorithms for clustering bigram and trigram word equivalence classes, and their respective results: 1) We give a detailed time complexity analysis of bigram clustering algorithms. 2) We present an improved implementation of bigram clustering so that large corpora (38 million words and more) can be clustered within a small number of days or even hours. 3) We extend the clustering approach from bigrams to trigrams. 4) We present experimental results on a 38 million word training corpus.
منابع مشابه
Enhanced word classing for model M
Model M is a superior class-based n-gram model that has shown improvements on a variety of tasks and domains. In previous work with Model M, bigram mutual information clustering has been used to derive word classes. In this paper, we introduce a new word classing method designed to closely match with Model M. The proposed classing technique achieves gains in speech recognition word-error rate o...
متن کاملNew Developments in Lattice-Based Search Strategies in SRI’s Hub4 System
We describe new developments in SRI’s lattice-based progressive search strategy. These developments include the implementation of a new bigram lattice algorithm, lattice optimization techniques, and expansion of bigram lattices to trigram lattices. The new bigram lattice generation algorithm is based on generation of backtrace entries using a word-dependent N-best list decoding pass, followed b...
متن کاملBuilding and Incorporating Language Models for Persian Continuous Speech Recognition Systems
In this paper building statistical language models for Persian language using a corpus and incorporating them in Persian continuous speech recognition (CSR) system are described. We used Persian Text Corpus for building the language models. First we preprocessed the texts of corpus by correcting the different orthography of words. Also, the number of POS tags was decreased by clustering POS tag...
متن کاملEfficient lattice representation and generation
In large-vocabulary, multi-pass speech recognition systems, it is desirable to generate word lattices incorporating a large number of hypotheses while keeping the lattice sizes small. We describe two new techniques for reducing word lattice sizes without eliminating hypotheses. The first technique is an algorithm to reduce the size of non-deterministic bigram word lattices. The algorithm iterat...
متن کاملClass phrase models for language modelling
Previous attempts to automatically determine multi-words as the basic unit for language modeling have been successful for extending bigram models 10, 9, 2, 8] to improve the per-plexity of the language model and/or the word accuracy of the speech decoder. However, none of these techniques gave improvements over the trigram model so far, except for the rather controlled ATIS task 8]. We therefor...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Speech Communication
دوره 24 شماره
صفحات -
تاریخ انتشار 1995